Unsupervised learning

Assignment 2

Data Set
https://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work

1. Data Exploration

2. Data Visualization

3. Outlier Detection

NORMALIZATION

4. Feature Selection

1. using Random Forest importances:

2. using Permutation importance:

All the attributes with non zero values are selected from permutation importance

5. Clustering

1. K-means:

Let's try possible elbow points by creating plots. We can use t-SNE for visualizing 9D data on 3D space.

see https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html

It seems like 4 clusters work best. Let's try the same process again with PCA + K-means.

2. AGNES:

It seems like 3, 6 or 9 are best amongst them (cluster with euqal sizes and significant silhouette value). We choosed to work with n_clusters = 3.

Let's plot the complete dendrogram to see the possible clusterings.

3. DBSCAN:

6. Evaluation Metrics